Pretokenization of dataset for finetuning

starsofchance · May 30, 2025, 7:26pm

Hello friends,

I have a dataset (Prime_Vul) for fine-tuning.
The data format is as follows: desired input and output:
JSON

{
“idx”: 201849,
“input”: “void sqlite3Pragma( Parse *pParse, Token *pId1, …rest of the code for input }”,
“output”: {
“is_vulnerable”: true,
“vulnerability_types”: [“Improper Check for Unusual or Exceptional Conditions”],
“explanation”: “pragma.c in SQLite through 3.30.1 mishandles NOT NULL in an integrity_check PRAGMA command in certain cases of generated columns.”,
“severity_level”: “NoInfo”,
“cwe”: [“CWE-754”],
“cve”: “CVE-2019-19646”
},
“code_token_length”: 19000,
“total_token_length”: 20042,
“max_tokens_setting”: 32768
}

My question is, if I want to tokenize this dataset and then use it for the fine-tuning process, which parts exactly should I tokenize?
All the data? Or just the input?
Please guide me.

Mdrnfox · May 30, 2025, 8:04pm

you need to tokenize everything that the model will see and learn to predict. The output text must also be tokenized so it can appear in the model’s decoder during training and contribute to the loss.

Hope that helps

starsofchance · May 30, 2025, 8:58pm

thank you so much.
yes that helped a lot.
but now i have more question, i would appreciate if you help me with these too.
1: should i also tokenise the labels?
2: i should use padding correct?
so in every batch of dataset (different maximum token length) the input sequence be the same length for every function the code receives during fine tuning?
3: should i also use padding on the output?
4: and if yes, should i padd the input and output separately? or as one?
Again, thank you for the first reply.

Mdrnfox · May 30, 2025, 9:05pm

I would tokenize the necessary labels, you can drop unneccessary labels. i would tokenize the prompt and completion is. Then you can carve out where the tokens get delegated. I would use a max length padding. Yes, the padding must match the padding shape of your input padding.

Something like this batch = tokenizer(
prompts,
completions,
padding=“longest”,
truncation=True,
return_tensors=“pt”
)

starsofchance · May 31, 2025, 10:10am

Thank you so so much for the answer.
It was a huge help to me.

Topic		Replies	Views
How to fine-tune GPT on my own data for text generation Beginners	0	2184	January 17, 2022
Training with varying lengths of sequences Beginners	0	1591	May 31, 2023
Dataset Tokenization to Fine-Tune Gemma 3 1B Beginners	2	241	April 5, 2025
T5 instruction finetuning Models	0	45	September 9, 2024
Fine-tuning - tokenize before or when doing a forward pass over batches 🤗Transformers	2	1502	March 22, 2024

Pretokenization of dataset for finetuning

Related topics